According to the documentation(https://search.r-project.org/CRAN/refmans/spData/html/boston.html),this dataset contains housing data that was collected as part of the 1970 census of Boston, Massachusetts.The corrected data from the Harrison and Rubinfeld (1978) are contained in a data frame, which is comprised by 506 rows and 20 columns.Each observation (row) in the dataset contains a collection of statistics corresponding to a single census ‘tract’ (a small geographic region containing multiple houses, defined specifically for a census). Some notes are that that MEDV is censored, in that median values at or over USD 50,000 are set to USD 50,000.
In this project we will consider the spatial distribution of the CMEDV variable. This variable corresponds to the median value (in USD 000s) of owner-occupied housing in each census tract. Each tract is also associated with a point location; geographic coordinates for this point (measured in decimal degrees latitude and longitude), as well as the town in which it is located (within the Greater Boston area), are provided for each observation.
We are going to derive a smaller dataframe from the above data set that contains only the variables TOWN, LON, LAT and CMEDV:
| TOWN | LON | LAT | CMEDV |
|---|---|---|---|
| Nahant | -70.96 | 42.26 | 24.0 |
| Swampscott | -70.95 | 42.29 | 21.6 |
| Swampscott | -70.94 | 42.28 | 34.7 |
| Marblehead | -70.93 | 42.29 | 33.4 |
| Marblehead | -70.92 | 42.30 | 36.2 |
| x | |
|---|---|
| TOWN | 0 |
| LON | 0 |
| LAT | 0 |
| CMEDV | 0 |
Coordinates
Next, we project the coordinates provided on a map, where we can see that the points representing the latitudes and longitudes, are not matching the towns on the map.
## Assuming "lon" and "lat" are longitude and latitude, respectively
The map below shows a closer view of the coordinates. We can observe that some of the towns appear to be on the water.
## Assuming "lon" and "lat" are longitude and latitude, respectively
Finally, we’re going to choose only one of the down districts and project the wrong and right coordinates on the map in order to assess how to correct the coordinates.
## Assuming "LON" and "LAT" are longitude and latitude, respectively
In order to correct the data, we suppose that all coordinates are shifted by a certain amount. We assume that there are \(n_j\) observations in town \(j\), and for each observation \(k\) in town \(j\),we denote the longitudinal coordinate as \(x_{j,k} , k = 1,\dots, n_j\). Then we assume:
\[ x_{j,k}=TC^{(x)}_j+\Delta^{(x)}_{j,k}\] where \(TC^{(x)}_j\) is the longitudinal coordinate of the center of town j, and \(\Delta^{(x)}_{j,k}\) is the displacement of observation \(k\) in town \(j\) from the town center.We also assume that the latitudinal coordinates (which we denote \(y_{j,k}\)) satisfy a similar relationship. The suggested systematic error is therefore such that \((TC^{(x)}_j ,TC^{(y)}_j)\) has been misspecified for \(j = 1, \dots, n\) where n is the number of towns.
To find the displacement, we are going to use the correct center coordinates for each town in Boston that exist in the file BostonTownCentres.csv. First we are going to have a quick look at the data.
## Rows: 92 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): town
## dbl (2): lat, lon
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| town | lat | lon |
|---|---|---|
| Arlington | 42.41537 | -71.15644 |
| Ashland | 42.26066 | -71.46413 |
| Bedford | 42.49173 | -71.28179 |
| Belmont | 42.39593 | -71.17867 |
| Beverly | 42.55843 | -70.88005 |
Next we’re using an appropriate mutating join to combine the two data
sets.We check and observe that the number of columns in
boston.c doesn’t match the number of columns in the new
data frame.We find that the missing data corresponds to a town named
Saugus, which is spelled as Sargus in the original dataset. As a result,
we correct the instances of Sargus and join the original data frame with
BostonTownCentres.csv.
#Join data frames
join.coord<-centre.coord %>% left_join(BostonData, by=c('town'='TOWN'))
#Check number of rows match
nrow(join.coord)==nrow(BostonData)
## [1] FALSE
##Find the town that's missing
setdiff(unique(BostonData$TOWN), unique(join.coord$town))
## [1] "Sargus"
#Empty dataframe to avoid duplicates
join.coord<-NA
##Correct missing values
BostonData$TOWN[BostonData$TOWN=='Sargus']<-'Saugus'
#Join correct data frames
join.coord<-centre.coord %>% left_join(BostonData, by=c('town'='TOWN'))
nrow(join.coord)==nrow(BostonData)
## [1] TRUE
The map below shows the correct coordinates.We can already observe that there are no points that get project on the water and the towns on the map and legends seem to match.
## Assuming "lon" and "lat" are longitude and latitude, respectively
The map below shows a closer view of the coordinates.
## Assuming "lon" and "lat" are longitude and latitude, respectively
In order to fix our data set, we need replace the centroid for each town (i.e. for \(j = 1,\dots,n\)) of the \(n_j\) boston.c locations with the true town center. First, we are going to find the centroid in our dataset by grouping the data by town and finding the mean longitude and latitude. Then we calculate the displacement as so: \[x_{j,k}=TC^{(x)}_j+\Delta^{(x)}_{j,k} \Rightarrow \Delta^{(x)}_{j,k}=x_{j,k}-TC^{(x)}_j\] In the equation above, \(x_{j,k}\) is known and is equal to the coordinates in boston.c and \(TC^{(x)}_j\) was calculated above as the mean lon and lat. After, we add the displacement of each town to the centroids contained in BostonTownCentres.csv and create a new dataframe containing two columns with the true coordinates for each observation.
#Calculate the centroid in old data set
centroid<-BostonData %>% group_by(TOWN) %>% summarise(centre_lon=mean(LON),centre_lat=mean(LAT))
#data frame for correct lon-lat
new_cord<-data.frame(cor_lon=as.double(),cor_lat=as.double)
##Loop through all names in centroid
for (name in centroid$TOWN){
#Create a temporary data frame from our data containing the lon and lats of the town equal to name
temp<-BostonData %>% filter(TOWN==name)
#Create temporary data frames containing the wrong and correct cenrtoids of the town equal to name
temp.centre<-centroid %>% filter(TOWN==name)
cor.centroid<-centre.coord %>% filter(town==name)
#Calculate displacement for both lon-lat
dislon<-temp$LON-temp.centre$centre_lon
dislat<-temp$LAT-temp.centre$centre_lat
#Calculate the right coordinates
cor_lon<-cor.centroid$lon+dislon
cor_lat<-dislat+cor.centroid$lat
#Add the right coordinates to our new dataframe
new_cord<-rbind(new_cord, cbind(cor_lon,cor_lat))
}
#Combine the new data frame
join.coord<-cbind(join.coord,new_cord)
The final map can be seen below.
Finally, we construct a visualisation that shows the spatial distribution of the median value of owner-occupied housing in Greater Boston in 1970. In this instance, we are going to use ggmap.We observe that for some towns have only one observation so we can’t create polygons.
## Source : https://maps.googleapis.com/maps/api/staticmap?center=42.36008,-71.05888&zoom=10&size=640x640&scale=2&maptype=terrain&key=xxx-0NQyKizPR9jdAYCfTiyB5IhVfbdU2xI
The resulting map lacks visual appeal. Another strategy would be to use the corrected coordinates to complete the visualisation in Tableau. In reality, Tableau automatically matches some coordinates with the names of the towns, which would have simplified the process.